Languguage OS 2

home *** CD-ROM | disk | FTP | other *** search

/ Languguage OS 2 / Languguage OS II Version 10-94 (Knowledge Media)(1994).ISO / gnu / recode.lha / recode-3.2.4 / recode.info (.txt) < prev next >

Wrap

GNU Info File | 1992-08-23 | 45KB | 885 lines

This is Info file recode.info, produced by Makeinfo-1.47 from the input file recode.texi. Copyright (C) 1990 Free Software Foundation, Inc. Francois Pinard <pinard@iro.umontreal.ca>, 1988. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. File: recode.info, Node: Top, Next: Usage, Prev: (dir), Up: (dir) Conversion of files between different charsets and usages ********************************************************* This `recode' program has the purpose of converting files between various character sets and usages. When exact transliterations are not possible, as it is often the case, the program may get rid of the offending characters or fall back on approximations. Let us coin the term "charset" to represent, without distinction, a character set "per se" or a particular usage of a character set. This program recognizes or produces a little more than a dozen of such charsets. Since it can convert each charset to almost any other one, more than one hundred different conversions are possible. This tool pays special attention to superimposition of diacritics, particularily for French representation. This orientation is mostly historical, it does not impair the usefulness, generality or extensibility of the program. In fact, this program evolved for several years, through several programming languages and computer brands, because I used a lot of different coding for French characters on different machines, each system having its own peculiarities. You may find in this document: * Menu: * Usage:: How to use this program * Charsets:: Character sets recognized of produced * Easy French:: Easy French conventions * Internals:: Internal aspects * Future:: Future things -- The Detailed Node Listing -- Character sets recognized of produced * applemac:: ASCII 8-bits for Apple's Macintosh * ascii:: ASCII 7-bits, BS to overstrike * bangbang:: ASCII "bang bang", escapes are ! and !! * cccascii:: ASCII 8-bits as seen by Perkin Elmer * cdcascii:: ASCII 8-bits a seen by Control Data * cdcnos:: ASCII 6/12 from NOS, escapes are ^ and @ * ebcdic:: EBCDIC with no further comments * flat:: ASCII without diacritics nor underline * ibmpc:: ASCII 8-bits for IBM's PC * iconqnx:: ASCII as coded on Unisys' ICON * latex:: ASCII with LaTeX codes * latin1:: ASCII extended by Latin Alphabet 1 * texte:: ASCII with easy French conventions ASCII 7-bits, BS to overstrike * Commented ASCII:: * Octal ASCII:: * Decimal ASCII:: * Hexadecimal ASCII:: ASCII "bang bang", escapes are ! and !! * Display Code:: Control Data's Display Code ASCII extended by Latin Alphabet 1 * Commented Latin-1:: * Octal Latin-1:: * Decimal Latin-1:: * Hexadecimal Latin-1:: Easy French conventions * French quotes:: How to type them. * Latin ligatures:: They are not representable. * Diacritics:: How to type them, things to know. * Ending diaeresis:: List of words ending with diaeresis. * Easy French History:: When, How and Who. Internal aspects * Main flow:: Overall organization of the program. * Piping:: Distinction between internal or external piping. * Limitations:: A few limitations of the choosen implementation. * New charsets:: How to proceed in adding new charsets. File: recode.info, Node: Usage, Next: Charsets, Prev: Top, Up: Top How to use this program ======================= The general format of the program call is: recode [OPTION]... [BEFORE]:[AFTER] [FILE]... Each file FILE will be read assuming it is coded with charset BEFORE, it will be recoded over itself so to use the charset AFTER. If there is no such FILE, the program rather acts as a filter and recode standard input to standard output. The available options are: Given this option, all other parameters and options are ignored. The program prints briefly the Copyright and copying conditions. See the file `COPYING' in the distribution for full statement of the Copyright and copying conditions. With Easy French conventions, use the column `:' instead of the double-quote `"' for marking diaeresis. See: *Note Easy French::. This option is recognized, but otherwise ignored. Eventually, this option will be necessary for a file to be replaced by its recoded contents, if it is found that the recoding is not fully reversible. In this version, the replacement is unconditionnaly done. When the recoding requires a combination of two or more elementary recoding steps, this option forces many passes over the data, using intermediate files between passes. This is the default behaviour when files are recoded over themselves. If this option is selected in filter mode, that is, when the program reads standard input and writes standard output, it might take longer for programs further down the pipe chain to start receiving some recoded data. When the recoding requires a combination of two or more elementary recoding steps, this option forces the creation of a chain of program instances initiated through the `popen(3)' library call, all operating in parallel. In filter mode, at cost of some overhead, recoded data will be available soon after the program starts, even if many elementary recoding steps are required. If, at installation time, the `popen(3)' call is said to be unavailable, selecting option `-o' is equivalent to selecting option `-i'. When the recoding requires a combination of two or more elementary recoding steps, this option forces the program to fork itself into a few copies interconnected with pipes, using the `pipe(2)' system call. All copies of the program operate in parallel. This method is similar to the method used through option `-o', but is slightly more efficient. This is the default behaviour in filter mode. If this option is used when files are recoded over themselves, this should save some disk accesses and some disk space, at cost of more system overhead. If, at installation time, the `pipe(2)' call is said to be unavailable, selecting option `-p' is equivalent to selecting option `-o'. If both `pipe(2)' and `popen(3)' are unavailable, selecting option `-p' is equivalent to selecting option `-i'. The *touch* option is meaningful only when files are recoded over themselves. Without it, the timestamps associated with files are preserved, to reflect the fact that changing the code of a file does not really alter its informational contents. When the user wants the recoded files to be timestamped at the recoding time; this option inhibits the automatic protection of the timestamps. Before proceeding, the program will print on `stderr' the list and order of application of elementary conversions which are planned to achieve the global conversion. Then, the program will print on `stderr' one message per FILE recoded, so to let the user informed of the progress of its command. One or both of the BEFORE or AFTER keywords may be omitted, but the colon which separates them cannot. An omitted keyword implies the usual or default code in usage on the system where this program is installed. Usually, this default code is `latin1' for UNIX systems or `ibmpc' for MS-DOS machines, but it might be changed to any other supported code when this program is installed. File: recode.info, Node: Charsets, Next: Easy French, Prev: Usage, Up: Top Character sets recognized of produced ===================================== The possible values for charset BEFORE or charset AFTER are provided as the keys in the following menu. * Menu: * applemac:: ASCII 8-bits for Apple's Macintosh * ascii:: ASCII 7-bits, BS to overstrike * bangbang:: ASCII "bang bang", escapes are ! and !! * cccascii:: ASCII 8-bits as seen by Perkin Elmer * cdcascii:: ASCII 8-bits a seen by Control Data * cdcnos:: ASCII 6/12 from NOS, escapes are ^ and @ * ebcdic:: EBCDIC with no further comments * flat:: ASCII without diacritics nor underline * ibmpc:: ASCII 8-bits for IBM's PC * iconqnx:: ASCII as coded on Unisys' ICON * latex:: ASCII with LaTeX codes * latin1:: ASCII extended by Latin Alphabet 1 * texte:: ASCII with easy French conventions File: recode.info, Node: applemac, Next: ascii, Prev: Charsets, Up: Charsets ASCII 8-bits for Apple's Macintosh ---------------------------------- The file has been obtained or is aimed to a Macintosh micro-computer from Apple. This is an eight bit code. The file is the data fork only. File: recode.info, Node: ascii, Next: bangbang, Prev: applemac, Up: Charsets ASCII 7-bits, BS to overstrike ------------------------------ The file is straight ASCII, seven bits only. According to the definition of ASCII: diacritics are applied by a sequence of three characters: the letter, one BS, the diacritic mark. We deviate slightly from this by exchanging the diacritic mark and the letter so, on a screen device, the diacritic will disappear and let the letter alone. At recognition time, both methods are acceptable. The French quotes are coded by the sequences: `< BS "' or `" BS <' for the opening quote and `> BS "' or `" BS >' for the closing quote. This artifical convention was inherited in straight `ascii' from habits around `bangbang' entry, and is not well known. But we decided to stick to it so that `ascii' charset will not loose French quotes. * Menu: * Commented ASCII:: * Octal ASCII:: * Decimal ASCII:: * Hexadecimal ASCII:: File: recode.info, Node: Commented ASCII, Next: Octal ASCII, Prev: ascii, Up: ascii Commented ASCII ............... oct dec hex name description 000 0 0 nul null character 001 1 1 soh start of header 002 2 2 stx start of text 003 3 3 etx end of text 004 4 4 eot end of transmission 005 5 5 enq enquiry 006 6 6 ack acknowledge 007 7 7 bel bell 010 8 8 bs back space 011 9 9 ht horizontal tab 012 10 a nl new line 013 11 b vt vertical tab 014 12 c np new page 015 13 d cr carriage return 016 14 e so shift out 017 15 f si shift in 020 16 10 dle data link escape 021 17 11 dc1 device control 1 022 18 12 dc2 device control 2 023 19 13 dc3 device control 3 024 20 14 dc4 device control 4 025 21 15 nak negative acknowledge 026 22 16 syn synchronize 027 23 17 etb end of transmitted block 030 24 18 can cancel 031 25 19 em end of medium 032 26 1a sub substitute 033 27 1b esc escape 034 28 1c fs file separator 035 29 1d gs group separator 036 30 1e rs record separator 037 31 1f us unit separator 040 32 20 sp space 177 127 7f del delete File: recode.info, Node: Octal ASCII, Next: Decimal ASCII, Prev: Commented ASCII, Up: ascii Octal ASCII ........... 000 nul 020 dle 040 sp 060 0 100 @ 120 P 140 ` 160 p 001 soh 021 dc1 041 ! 061 1 101 A 121 Q 141 a 161 q 002 stx 022 dc2 042 " 062 2 102 B 122 R 142 b 162 r 003 etx 023 dc3 043 # 063 3 103 C 123 S 143 c 163 s 004 eot 024 dc4 044 $ 064 4 104 D 124 T 144 d 164 t 005 enq 025 nak 045 % 065 5 105 E 125 U 145 e 165 u 006 ack 026 syn 046 & 066 6 106 F 126 V 146 f 166 v 007 bel 027 etb 047 ' 067 7 107 G 127 W 147 g 167 w 010 bs 030 can 050 ( 070 8 110 H 130 X 150 h 170 x 011 ht 031 em 051 ) 071 9 111 I 131 Y 151 i 171 y 012 nl 032 sub 052 * 072 : 112 J 132 Z 152 j 172 z 013 vt 033 esc 053 + 073 ; 113 K 133 [ 153 k 173 { 014 np 034 fs 054 , 074 < 114 L 134 \ 154 l 174 | 015 cr 035 gs 055 - 075 = 115 M 135 ] 155 m 175 } 016 so 036 rs 056 . 076 > 116 N 136 ^ 156 n 176 ~ 017 si 037 us 057 / 077 ? 117 O 137 _ 157 o 177 del File: recode.info, Node: Decimal ASCII, Next: Hexadecimal ASCII, Prev: Octal ASCII, Up: ascii Decimal ASCII ............. 0 nul 16 dle 32 sp 48 0 64 @ 80 P 96 ` 112 p 1 soh 17 dc1 33 ! 49 1 65 A 81 Q 97 a 113 q 2 stx 18 dc2 34 " 50 2 66 B 82 R 98 b 114 r 3 etx 19 dc3 35 # 51 3 67 C 83 S 99 c 115 s 4 eot 20 dc4 36 $ 52 4 68 D 84 T 100 d 116 t 5 enq 21 nak 37 % 53 5 69 E 85 U 101 e 117 u 6 ack 22 syn 38 & 54 6 70 F 86 V 102 f 118 v 7 bel 23 etb 39 ' 55 7 71 G 87 W 103 g 119 w 8 bs 24 can 40 ( 56 8 72 H 88 X 104 h 120 x 9 ht 25 em 41 ) 57 9 73 I 89 Y 105 i 121 y 10 nl 26 sub 42 * 58 : 74 J 90 Z 106 j 122 z 11 vt 27 esc 43 + 59 ; 75 K 91 [ 107 k 123 { 12 np 28 fs 44 , 60 < 76 L 92 \ 108 l 124 | 13 cr 29 gs 45 - 61 = 77 M 93 ] 109 m 125 } 14 so 30 rs 46 . 62 > 78 N 94 ^ 110 n 126 ~ 15 si 31 us 47 / 63 ? 79 O 95 _ 111 o 127 del File: recode.info, Node: Hexadecimal ASCII, Prev: Decimal ASCII, Up: ascii Hexadecimal ASCII ................. 00 nul 10 dle 20 sp 30 0 40 @ 50 P 60 ` 70 p 01 soh 11 dc1 21 ! 31 1 41 A 51 Q 61 a 71 q 02 stx 12 dc2 22 " 32 2 42 B 52 R 62 b 72 r 03 etx 13 dc3 23 # 33 3 43 C 53 S 63 c 73 s 04 eot 14 dc4 24 $ 34 4 44 D 54 T 64 d 74 t 05 enq 15 nak 25 % 35 5 45 E 55 U 65 e 75 u 06 ack 16 syn 26 & 36 6 46 F 56 V 66 f 76 v 07 bel 17 etb 27 ' 37 7 47 G 57 W 67 g 77 w 08 bs 18 can 28 ( 38 8 48 H 58 X 68 h 78 x 09 ht 19 em 29 ) 39 9 49 I 59 Y 69 i 79 y 0a nl 1a sub 2a * 3a : 4a J 5a Z 6a j 7a z 0b vt 1b esc 2b + 3b ; 4b K 5b [ 6b k 7b { 0c np 1c fs 2c , 3c < 4c L 5c \ 6c l 7c | 0d cr 1d gs 2d - 3d = 4d M 5d ] 6d m 7d } 0e so 1e rs 2e . 3e > 4e N 5e ^ 6e n 7e ~ 0f si 1f us 2f / 3f ? 4f O 5f _ 6f o 7f del File: recode.info, Node: bangbang, Next: cccascii, Prev: ascii, Up: Charsets ASCII "bang bang", escapes are ! and !! --------------------------------------- This is the local code in use on Cybers at Universite de Montreal, which grave and serious people there prefer to name "ASCII code display". This code is also known as "Bang-bang". It is based on a six bits character set in which capitals, French diacritics and a few others are coded using an `!' escape followed by a single character, and control characters using a double `!' escape followed by a single character. The routines given here presume that the six bits code is already expressed in ASCII by the communication channel, with embedded ASCII `!' escapes. Here is a table showing which characters are being used to encode each ASCII character. 000 !!@ 020 !!P 040 060 0 100 @ 120 !P 140 !@ 160 P 001 !!A 021 !!Q 041 !" 061 1 101 !A 121 !Q 141 A 161 Q 002 !!B 022 !!R 042 " 062 2 102 !B 122 !R 142 B 162 R 003 !!C 023 !!S 043 # 063 3 103 !C 123 !S 143 C 163 S 004 !!D 024 !!T 044 $ 064 4 104 !D 124 !T 144 D 164 T 005 !!E 025 !!U 045 % 065 5 105 !E 125 !U 145 E 165 U 006 !!F 026 !!V 046 & 066 6 106 !F 126 !V 146 F 166 V 007 !!G 027 !!W 047 ' 067 7 107 !G 127 !W 147 G 167 W 010 !!H 030 !!X 050 ( 070 8 110 !H 130 !X 150 H 170 X 011 !!I 031 !!Y 051 ) 071 9 111 !I 131 !Y 151 I 171 Y 012 !!J 032 !!Z 052 * 072 : 112 !J 132 !Z 152 J 172 Z 013 !!K 033 !![ 053 + 073 ; 113 !K 133 [ 153 K 173 ![ 014 !!L 034 !!\ 054 , 074 < 114 !L 134 \ 154 L 174 !\ 015 !!M 035 !!] 055 - 075 = 115 !M 135 ] 155 M 175 !] 016 !!N 036 !!^ 056 . 076 > 116 !N 136 ^ 156 N 176 !^ 017 !!O 037 !!_ 057 / 077 ? 117 !O 137 _ 157 O 177 !_ * Menu: * Display Code:: Control Data's Display Code File: recode.info, Node: Display Code, Prev: bangbang, Up: bangbang Control Data's Display Code ........................... Octal display code to graphic Octal display code to octal ASCII 00 : 20 P 40 5 60 # 00 072 20 120 40 065 60 043 01 A 21 Q 41 6 61 [ 01 101 21 121 41 066 61 133 02 B 22 R 42 7 62 ] 02 102 22 122 42 067 62 135 03 C 23 S 43 8 63 % 03 103 23 123 43 070 63 045 04 D 24 T 44 9 64 " 04 104 24 124 44 071 64 042 05 E 25 U 45 + 65 _ 05 105 25 125 45 053 65 137 06 F 26 V 46 - 66 ! 06 106 26 126 46 055 66 041 07 G 27 W 47 * 67 & 07 107 27 127 47 052 67 046 10 H 30 X 50 / 70 ' 10 110 30 130 50 057 70 047 11 I 31 Y 51 ( 71 ? 11 111 31 131 51 050 71 077 12 J 32 Z 52 ) 72 < 12 112 32 132 52 051 72 074 13 K 33 0 53 $ 73 > 13 113 33 060 53 044 73 076 14 L 34 1 54 = 74 @ 14 114 34 061 54 075 74 100 15 M 35 2 55 75 \ 15 115 35 062 55 040 75 134 16 N 36 3 56 , 76 ^ 16 116 36 063 56 054 76 136 17 O 37 4 57 . 77 ; 17 117 37 064 57 056 77 073 File: recode.info, Node: cccascii, Next: cdcascii, Prev: bangbang, Up: Charsets ASCII 8-bits as seen by Perkin Elmer ------------------------------------ This charset represents the way Concurrent Computer Corporation (formerly Perkin Elmer) expresses EBCDIC using ASCII. File: recode.info, Node: cdcascii, Next: cdcnos, Prev: cccascii, Up: Charsets ASCII 8-bits a seen by Control Data ----------------------------------- This charset represents the way Control Data Corporation relates EBCDIC to ASCII. We also select the lower half of this table to do straigth ASCII to EBCDIC conversions, back and forth. File: recode.info, Node: cdcnos, Next: ebcdic, Prev: cdcascii, Up: Charsets ASCII 6/12 from NOS, escapes are ^ and @ ---------------------------------------- This is one of the charset in use on CDC Cyber NOS systems to represent ASCII, sometimes named "NOS 6/12" code for coding ASCII. This code is also known as "caret ASCII". It is based on a six bits character set in which small letters and control characters are coded using a `^' escape and, sometimes, a `@' escape. The routines given here presume that the six bits code is already expressed in ASCII by the communication channel, with embedded ASCII `^' and `@' escapes. Here is a table showing which characters are being used to encode each ASCII character. 000 ^5 020 ^# 040 060 0 100 @A 120 P 140 @G 160 ^P 001 ^6 021 ^[ 041 ! 061 1 101 A 121 Q 141 ^A 161 ^Q 002 ^7 022 ^] 042 " 062 2 102 B 122 R 142 ^B 162 ^R 003 ^8 023 ^% 043 # 063 3 103 C 123 S 143 ^C 163 ^S 004 ^9 024 ^" 044 $ 064 4 104 D 124 T 144 ^D 164 ^T 005 ^+ 025 ^_ 045 % 065 5 105 E 125 U 145 ^E 165 ^U 006 ^- 026 ^! 046 & 066 6 106 F 126 V 146 ^F 166 ^V 007 ^* 027 ^& 047 ' 067 7 107 G 127 W 147 ^G 167 ^W 010 ^/ 030 ^' 050 ( 070 8 110 H 130 X 150 ^H 170 ^X 011 ^( 031 ^? 051 ) 071 9 111 I 131 Y 151 ^I 171 ^Y 012 ^) 032 ^< 052 * 072 @D 112 J 132 Z 152 ^J 172 ^Z 013 ^$ 033 ^> 053 + 073 ; 113 K 133 [ 153 ^K 173 ^0 014 ^= 034 ^@ 054 , 074 < 114 L 134 \ 154 ^L 174 ^1 015 ^ 035 ^\ 055 - 075 = 115 M 135 ] 155 ^M 175 ^2 016 ^, 036 ^^ 056 . 076 > 116 N 136 @B 156 ^N 176 ^3 017 ^. 037 ^; 057 / 077 ? 117 O 137 _ 157 ^O 177 ^4 File: recode.info, Node: ebcdic, Next: flat, Prev: cdcnos, Up: Charsets EBCDIC with no further comments ------------------------------- This charset is the IBM's external binary coded decimal for interchange coding. This is an eight bits code. File: recode.info, Node: flat, Next: ibmpc, Prev: ebcdic, Up: Charsets ASCII without diacritics nor underline -------------------------------------- This code is ASCII expunged of all diacritics and underlines, as long as they are applied using three character sequences, with BS in the middle. Also, despite slightly unrelated, each control character is represented by a sequence of two or three graphic characters. The newline character, however, keeps its functionnality and is not represented. Note that charset `flat' is a terminal charset. We can convert *to* `flat', but not *from* it. File: recode.info, Node: ibmpc, Next: iconqnx, Prev: flat, Up: Charsets ASCII 8-bits for IBM's PC ------------------------- The file was obtained or is aimed towards a PC microcomputer from IBM or any compatible. This is an eight-bit code. File: recode.info, Node: iconqnx, Next: latex, Prev: ibmpc, Up: Charsets ASCII for the Unisys' ICON -------------------------- The file is using Unisys' ICON way to represent diacritics with 0x19 escape sequences. This is a seven-bit code, even if eight-bit codes can flow through as part of IBM-PC charset. File: recode.info, Node: latex, Next: latin1, Prev: iconqnx, Up: Charsets ASCII with LaTeX codes ---------------------- This charset is an ASCII file coded to be read by LaTeX or, in certain cases, by TeX. File: recode.info, Node: latin1, Next: texte, Prev: latex, Up: Charsets ASCII extended by Latin Alphabet 1 ---------------------------------- This charset corresponds to the ISO Latin Alphabet 1. It is an eight-bit code which coincides with ASCII for the lower half. * Menu: * Commented Latin-1:: * Octal Latin-1:: * Decimal Latin-1:: * Hexadecimal Latin-1:: File: recode.info, Node: Commented Latin-1, Next: Octal Latin-1, Prev: latin1, Up: latin1 Commented Latin-1 ................. oct dec hex description 240 160 a0 no-break space 241 161 a1 inverted exclamation mark 242 162 a2 cent sign 243 163 a3 pound sign 244 164 a4 currency sign 245 165 a5 yen sign 246 166 a6 broken bar 247 167 a7 paragraph sign, section sign 250 168 a8 diaeresis 251 169 a9 copyright sign 252 170 aa feminine ordinal indicator 253 171 ab left angle quotation mark 254 172 ac not sign 255 173 ad soft hyphen 256 174 ae registered trade mark sign 257 175 af macron 260 176 b0 degree sign 261 177 b1 plus-minus sign 262 178 b2 superscript two 263 179 b3 superscript three 264 180 b4 acute accent 265 181 b5 small greek mu, micro sign 266 182 b6 pilcrow sign 267 183 b7 middle dot 270 184 b8 cedilla 271 185 b9 superscript one 272 186 ba masculine ordinal indicator 273 187 bb right angle quotation mark 274 188 bc vulgar fraction one quarter 275 189 bd vulgar fraction one half 276 190 be vulgar fraction three quarters 277 191 bf inverted question mark 300 192 c0 capital A with grave accent 301 193 c1 capital A with acute accent 302 194 c2 capital A with circumflex accent 303 195 c3 capital A with tilde 304 196 c4 capital A diaeresis 305 197 c5 capital A with ring above 306 198 c6 capital diphthong A with E 307 199 c7 capital C with cedilla 310 200 c8 capital E with grave accent 311 201 c9 capital E with acute accent 312 202 ca capital E with circumflex accent 313 203 cb capital E with diaeresis 314 204 cc capital I with grave accent 315 205 cd capital I with acute accent 316 206 ce capital I with circumflex accent 317 207 cf capital I with diaeresis 320 208 d0 capital icelandic ETH 321 209 d1 capital N with tilde 322 210 d2 capital O with grave accent 323 211 d3 capital O with acute accent 324 212 d4 capital O with circumflex accent 325 213 d5 capital O with tilde 326 214 d6 capital O with diaeresis 327 215 d7 multiplication sign 330 216 d8 capital O with oblique stroke 331 217 d9 capital U with grave accent 332 218 da capital U with acute accent 333 219 db capital U with circumflex accent 334 220 dc capital U with diaeresis 335 221 dd capital Y with acute accent 336 222 de capital icelandic THORN 337 223 df small german sharp s 340 224 e0 small a with grave accent 341 225 e1 small a with acute accent 342 226 e2 small a with circumflex accent 343 227 e3 small a with tilde 344 228 e4 small a with diaeresis 345 229 e5 small a with ring above 346 230 e6 small diphthong a with e 347 231 e7 small c with cedilla 350 232 e8 small e with grave accent 351 233 e9 small e with acute accent 352 234 ea small e with circumflex accent 353 235 eb small e with diaeresis 354 236 ec small i with grave accent 355 237 ed small i with acute accent 356 238 ee small i with circumflex accent 357 239 ef small i with diaeresis 360 240 f0 small icelandic eth 361 241 f1 small n with tilde 362 242 f2 small o with grave accent 363 243 f3 small o with acute accent 364 244 f4 small o with circumflex accent 365 245 f5 small o with tilde 366 246 f6 small o with diaeresis 367 247 f7 division sign 370 248 f8 small o with oblique stroke 371 249 f9 small u with grave accent 372 250 fa small u with acute accent 373 251 fb small u with circumflex accent 374 252 fc small u with diaeresis 375 253 fd small y with acute accent 376 254 fe small icelandic thorn 377 255 ff small y with diaeresis File: recode.info, Node: Octal Latin-1, Next: Decimal Latin-1, Prev: Commented Latin-1, Up: latin1 Octal Latin-1 ............. 200 220 240 nsp 260 ++ 300 A` 320 DD 340 a` 360 dd 201 221 241 !! 261 +- 301 A' 321 N~ 341 a' 361 n~ 202 222 242 c| 262 22 302 A^ 322 O` 342 a^ 362 o` 203 223 243 ## 263 33 303 A~ 323 O' 343 a~ 363 o' 204 224 244 cur 264 '' 304 A" 324 O^ 344 a" 364 o^ 205 225 245 y- 265 uu 305 A+ 325 O~ 345 a+ 365 o~ 206 226 246 || 266 pil 306 AE 326 O" 346 ae 366 o" 207 227 247 $$ 267 .. 307 C, 327 xx 347 c, 367 // 210 230 250 "" 270 ,, 310 E` 330 O/ 350 e` 370 o/ 211 231 251 cO 271 11 311 E' 331 U` 351 e' 371 u` 212 232 252 a- 272 o- 312 E^ 332 U' 352 e^ 372 u' 213 233 253 << 273 >> 313 E" 333 U^ 353 e" 373 u^ 214 234 254 -. 274 14 314 I` 334 U" 354 i` 374 u" 215 235 255 -- 275 12 315 I' 335 Y' 355 i' 375 y' 216 236 256 tO 276 34 316 I^ 336 PP 356 i^ 376 pp 217 237 257 mac 277 ?? 317 I" 337 ss 357 i" 377 y" File: recode.info, Node: Decimal Latin-1, Next: Hexadecimal Latin-1, Prev: Octal Latin-1, Up: latin1 Decimal Latin-1 ............... 128 144 160 nsp 176 ++ 192 A` 208 DD 224 a` 240 dd 129 145 161 !! 177 +- 193 A' 209 N~ 225 a' 241 n~ 130 146 162 c| 178 22 194 A^ 210 O` 226 a^ 242 o` 131 147 163 ## 179 33 195 A~ 211 O' 227 a~ 243 o' 132 148 164 cur 180 '' 196 A" 212 O^ 228 a" 244 o^ 133 149 165 y- 181 uu 197 A+ 213 O~ 229 a+ 245 o~ 134 150 166 || 182 pil 198 AE 214 O" 230 ae 246 o" 135 151 167 $$ 183 .. 199 C, 215 xx 231 c, 247 // 136 152 168 "" 184 ,, 200 E` 216 O/ 232 e` 248 o/ 137 153 169 cO 185 11 201 E' 217 U` 233 e' 249 u` 138 154 170 a- 186 o- 202 E^ 218 U' 234 e^ 250 u' 139 155 171 << 187 >> 203 E" 219 U^ 235 e" 251 u^ 140 156 172 -. 188 14 204 I` 220 U" 236 i` 252 u" 141 157 173 -- 189 12 205 I' 221 Y' 237 i' 253 y' 142 158 174 tO 190 34 206 I^ 222 PP 238 i^ 254 pp 143 159 175 mac 191 ?? 207 I" 223 ss 239 i" 255 y" File: recode.info, Node: Hexadecimal Latin-1, Prev: Decimal Latin-1, Up: latin1 Hexadecimal Latin-1 ................... 80 90 a0 nsp b0 ++ c0 A` d0 DD e0 a` f0 dd 81 91 a1 !! b1 +- c1 A' d1 N~ e1 a' f1 n~ 82 92 a2 c| b2 22 c2 A^ d2 O` e2 a^ f2 o` 83 93 a3 ## b3 33 c3 A~ d3 O' e3 a~ f3 o' 84 94 a4 cur b4 '' c4 A" d4 O^ e4 a" f4 o^ 85 95 a5 y- b5 uu c5 A+ d5 O~ e5 a+ f5 o~ 86 96 a6 || b6 pil c6 AE d6 O" e6 ae f6 o" 87 97 a7 $$ b7 .. c7 C, d7 xx e7 c, f7 // 88 98 a8 "" b8 ,, c8 E` d8 O/ e8 e` f8 o/ 89 99 a9 cO b9 11 c9 E' d9 U` e9 e' f9 u` 8a 9a aa a- ba o- ca E^ da U' ea e^ fa u' 8b 9b ab << bb >> cb E" db U^ eb e" fb u^ 8c 9c ac -. bc 14 cc I` dc U" ec i` fc u" 8d 9d ad -- bd 12 cd I' dd Y' ed i' fd y' 8e 9e ae tO be 34 ce I^ de PP ee i^ fe pp 8f 9f af mac bf ?? cf I" df ss ef i" ff y" File: recode.info, Node: texte, Prev: latin1, Up: Charsets ASCII with easy French conventions ---------------------------------- This charset is identical to `ascii', save for French diacritics which are noted using a slightly different convention. See *Note Easy French:: for more details. File: recode.info, Node: Easy French, Next: Internals, Prev: Charsets, Up: Top Easy French conventions ======================= These conventions are used in `texte' and `latexte' charsets, which are seven bits codes. At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability. Of course, it would better to have a specialized keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In several mailing environment, the eight bit is often willfully destroyed (an horrible Crime that most people do not care to straighten up). See: * Menu: * French quotes:: How to type them. * Latin ligatures:: They are not representable. * Diacritics:: How to type them, things to know. * Ending diaeresis:: List of words ending with diaeresis. * Easy French History:: When, How and Who. File: recode.info, Node: French quotes, Next: Latin ligatures, Prev: Easy French, Up: Easy French French quotes ------------- French quotes (sometimes called "angle quotes") are noted the same way English quotes are noted in TeX, *id est* by ```' and `'''. File: recode.info, Node: Latin ligatures, Next: Diacritics, Prev: French quotes, Up: Easy French Latin ligatures --------------- No effort has been put to preserve Latin ligatures (`ae', `oe') which are representable in several other charsets. So, these ligatures may be lost through Easy French conventions. File: recode.info, Node: Diacritics, Next: Ending diaeresis, Prev: Latin ligatures, Up: Easy French Diacritics ---------- This is almost the French convention for simplified diacritics entry: Acute accent Grave accent Circumflex accent Diaeresis Cedilla In some countries, `:' is used instead of `"' to mark diaeresis. `recode' support one convention on a single call, depending on the `-c' option of the `recode' command. The convention is prone to loosing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is insufflated into the recognition routines. So, the following subtleties are systematically obeyed by the various recognizers. * A single quote which follows a `e' does not necessarily means an acute accent if it is followed by a single other one. For example: `e'' will give an `e' with an acute accent. `e''' will give a simple `e', with a closing quotation mark. `e'''' will give an `e' with an acute accent, followed by a closing quotation mark. There is a problem induced by this convention if there are English citations with a French text. In sentences like: There's a meeting at Archie's restaurant. the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled. * A double quote or colon, depending on `-c' option, which follows a vowel is interpreted as diaeresis only if it is followd by another letter. But there are in French several words that *end* with a diaeresis, the program also recognizes them. See *Note Ending diaeresis:: for a study of all the problematic cases. * A comma which follows a `c' is interpreted as a cedilla only if it is followd by one of the vowels `a', `o' and `u'. File: recode.info, Node: Ending diaeresis, Next: Easy French History, Prev: Diacritics, Up: Easy French List of words ending with diaeresis ----------------------------------- Here is a classification of all cases of a diaeresis at the end of a French word: * Words ending in "igue" - Feminine words without a relative masculine: besaigue" cigue" - Feminine words with a relative masculine: (1) aigue" ambigue" contigue" exigue" subaigue" suraigue" * Words not ending in "igue" - Ended by "i": (2) ai" congai" goi" hai"kai" inoui" sai" samurai" thai" tokai" - Ended by "e": canoe" - Ended by "u": (3) Esau" Notes: 1. There are supposed to be seven words in this case. So, one is missing. 2. Look at the following sentence: "Ai"e! Voici le proble`me que j'ai" or, using the `-c' option: Ai:e! Voici le proble`me que j'ai: There is an ambiguity between an *ai"*, the small animal, and the indicative future of *avoir* (first person singular), when followed by what could be a diaeresis mark. Hopefully, the case is solved by the fact that an apostrophe always precedes the verb and almost never the animal. 3. I did not pay attention to proper nouns, but this one showed up as being fairly evident. Just to complete this topic, note that it would be wrong to make a rule for all words ending in "igue" as needing a diaerisis. Here are counter-examples: becfigue be`sigue bigue bordigue bourdigue brigue contre-digue digue d'intrigue fatigue figue garrigue gigue igue intrigue ligue prodigue sarigue zigue File: recode.info, Node: Easy French History, Prev: Ending diaeresis, Up: Easy French When, How and Who. ------------------ Easy French has been in use in France for a while. Loic Dachary <loic@design.axis.fr> first exposed me to this particular convention. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Que'bec originating from Universite' de Montre'al. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognize the "best" convention to use, (best is not being defined, here) and to try to solve the main pithfalls associated with the selected convention. I'm particularily grateful to Claude Goutier <6@cc.umontreal.ca> whom, through numerous discussions in August 1988, was quite helpful in evaluating various hypothesis. File: recode.info, Node: Internals, Next: Future, Prev: Easy French, Up: Top Internal aspects ================ This information is organized in: * Menu: * Main flow:: Overall organization of the program. * Piping:: Distinction between internal or external piping. * Limitations:: A few limitations of the choosen implementation. * New charsets:: How to proceed in adding new charsets. File: recode.info, Node: Main flow, Next: Piping, Prev: Internals, Up: Internals Overall organization -------------------- The main driver has a table giving the conversion routines available and for each, the starting charset and the ending charset. It then tries to figure out the shortest sequence of conversions that will transform the input charset into the final charset. Let us consider these charsets as being the nodes of a directed graph. `recode' has internally a few elementary recoding methods, called "single-step"s, each of which may be considered as oriented arc from one node to the other. A cost is attributed to each single-step. Given a starting code and a goal code, `recode' computes the most economical route through the elementary recodings. The main part of `recode' is written in C, as are most single-steps. A few single-steps which need to recognize sequences of multiple characters are written in `lex'. File: recode.info, Node: Piping, Next: Limitations, Prev: Main flow, Up: Internals Internal vs external piping --------------------------- Suppose that four elementary steps are selected at path optimization time. Then `recode' will split itself into four different tasks interconnected with pipes, logically equivalent to: step1 <input | step2 | step3 | step4 >output File: recode.info, Node: Limitations, Next: New charsets, Prev: Piping, Up: Internals Some limitations ---------------- Here are some limitations of the program. * There is a limit (currently 10) on the number of steps allowed in one single recodification work. It should stay sufficient for quite a while, maybe for ever. This is a simple compilation `#define', in any case. File: recode.info, Node: New charsets, Prev: Limitations, Up: Internals Adding new charsets ------------------- It is fairly easy for a programmer to add a new charset to `recode'. All it requires is making two routines, modifying a few tables, and `make'ing `recode' again. One of the routine should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not loose too much information while converting. If you have to read multiple bytes of the old charset before recognizing the character to produce, you might write this routine in `lex'; otherwize, use C. Prototype your routine after one of those which exists, so to keep the sources uniform. The other routine should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Select any charset for which you will not loose too much information while converting. If the routine has to read multiple bytes of the new charset before deciding which character it will produce, you might write this routine in `lex'; otherwize, use C. Prototype your routine after one of those which exists, so to keep the sources uniform. Edit `Makefile' to add the object name of your two routines to the `C_STEPS' or `L_STEPS' macro definition, depending on the fact your routines is written in C or in `lex'. Then edit `steps.h' in the four following places: 1. Create a symbol for your new charset in `enum TYPE_code' definition. 2. Add the option name of your new charset in `code_keywords' initialization. 3. Add two `extern' declarations for your routines at the appropriate places. 4. Add two lines in `single_steps' array initialization to declare your routines. For each line, include the four following fields: 1. The function name of your routine. 2. The starting code `enum' constant, that is, the code your routine *reads*. 3. The goal code `enum' constant, that is, the code your routine *produces*. 4. The cost of your routine, using the predefined constants `STEP', `LOOSE', `EXACT', `SLOW' and `FAST'. See the comments for the exact meaning of each of these and follow the examples. Respect these meanings and be honest with the costs! In some circumstances, one of your routines would be a mere copy. It is better in this case to not provide the routine, but still declare it in `single_steps' using `NULL' as its function name and `ALREADY' *alone* as its cost. File: recode.info, Node: Future, Prev: Internals, Up: Top Future things ============= I will be glad to hear critics and suggestions, even for details. This program is made up of hundreds of details, in fact. Write to `pinard@iro.umontreal.ca'. Some notes and suggestions. * Accept abbreviations for charsets on the command call. Accept more than one conversion with intermediate filters in a single call. * Support Universite de Montreal "accent" convention. * Support `[nt]roff' diacritics. * Support the Atari-ST internal code. * Segregate charsets and usages. * Is there some way of specifying that recode should not contract something that looks like an accent? Like "There\'s a meeting at Archie\'s restaurant"? (With corresponding insertion of backslashes or whatevers when converting the other way, of course - the transformation from accented to ascii should be exactly invertable in all cases.) Of course, There\'s will not be contracted. Tag Table: Node: Top Node: Usage Node: Charsets Node: applemac Node: ascii Node: Commented ASCII 10698 Node: Octal ASCII 12314 Node: Decimal ASCII 13465 Node: Hexadecimal ASCII 14542 Node: bangbang 15575 Node: Display Code 17531 Node: cccascii 18896 Node: cdcascii 19180 Node: cdcnos 19529 Node: ebcdic 21422 Node: flat 21679 Node: ibmpc 22291 Node: iconqnx 22544 Node: latex 22865 Node: latin1 23083 Node: Commented Latin-1 23457 Node: Octal Latin-1 27856 Node: Decimal Latin-1 29048 Node: Hexadecimal Latin-1 30246 Node: texte 31334 Node: Easy French 31640 Node: French quotes 32607 Node: Latin ligatures 32876 Node: Diacritics 33198 Node: Ending diaeresis 35198 Node: Easy French History 36949 Node: Internals 37782 Node: Main flow 38219 Node: Piping 39174 Node: Limitations 39562 Node: New charsets 39971 Node: Future 42609 End Tag Table